The goal of our project is to develop a recommendation system trained on beauty products data from Amazon. Based on some studies it has been proven that personalized product recommendations drive 24% of the orders and 26% of the revenue. This explains the influence recommendation has on volume of orders and generally on sales figures. What is more, it has been proven that product recommendations lead to reoccurring visits and that purchases on recommendation mark higher average-order value. Consequently, we decided to use method called user-based collaborative filtering to build our recommendation system (Reference).
First, we proceed with data preparation and pre-processing, then we build our recommender system, and finally draw business implications.
As we earlier mentioned, we use data on Amazon customer reviews of beauty products. The data used in this project can be accessed in this link. It contains the following features:
#Packages
library(R.utils)
library(dplyr)
library(tidyr)
library(janitor)
library(recommenderlab)
library(tm)
library(NLP)
library(qdap)
library(readr)
library(wordcloud)
After downloading data locally we load in data by usingreadLines() function:
# Loading in data
#my_data <- readLines(gzfile("data/Beauty.txt.gz"))
txt <- gzfile("data/Beauty.txt.gz")
my_data <- read_lines(txt,1)
Let us first have a look at the dimension of our data. Our data set is currently in a form of a single vector with 2772616 elements. Obviously, this is not the optimal form of the data we would like to work with. That is why we need to work around this data set to make it more convenient for further analysis.
What we can do first is to remove all fields with no characters:
my_data <- my_data[sapply(my_data, nchar) > 0]
Then we can convert it to data frame:
my_data <- as.data.frame(my_data)
colnames(my_data) <- "product"
One of the critical steps is separating the column to multiple columns:
# Separate one column to two (":" separator)
my_data <- separate(my_data,col = product, into = c("Info","Product"), sep = ":")
Inspecting first 10 values:
head(my_data,10)
The data set is loaded in .txt format, which makes it a bit challenging to work with. In the following sections we will undertake data manipulation in order to bring the data set in more suitable form.
First, we will convert it from the current long-format to the wide-format, where each column will represent a product, and each row a feature:
#Converting long format to wide
my_data <- my_data %>%
group_by(Info) %>%
mutate(Order = seq_along(Info)) %>%
spread(key = Order, value = Product)
Since the column names are labeled with numbers, we will apply first row as a label for the corresponding column name:
my_data <- as.data.frame(t(my_data))
my_data<-my_data%>%
row_to_names(row_number = 1)
Delete rows with at least 1 NAs:
my_data <- my_data[rowSums(is.na(my_data))==0,]
Trim white space at the beginning or ending the string:
my_data$`review/userId`<- trimws(my_data$`review/userId`)
my_data$`product/productId`<- trimws(my_data$`product/productId`)
my_data$`product/price`<- trimws(my_data$`product/price`)
my_data$`product/title`<- trimws(my_data$`product/title`)
Filtering out reviews with unknown userID and productId:
my_data<-filter(my_data,`review/userId`!="unknown" & `product/productId`!="unknown" & `product/price`!="unknown")
Correcting column classes:
my_data$`product/productId` <- as.factor(my_data$`product/productId`)
my_data$`review/score`<- as.numeric(my_data$`review/score`)
my_data$`review/userId`<-as.factor(my_data$`review/userId`)
my_data$`product/price`<-as.numeric(my_data$`product/price`)
In order to use relevant data, we would need to define the minimum number of reviews per user. Since majority of users left only one review. Therefore, we will remove all single-review users and all other users who left less then 9 reviews.
Filtering out users who left 9 or more reviews:
freq<-as.data.frame(table(my_data$`review/userId`))
index<-filter(freq, freq$Freq>=9)$Var1
We are now left with 1316 users who reviewed certain beauty product at least 9 times.
(my_data <- subset(my_data,`review/userId` %in% index))
head(my_data)
length(unique(my_data$`product/productId`))
[1] 4432
There are 0 products which were reviewed.
length(unique(my_data$`review/userId`))
[1] 1316
There are 0 unique reviewers/customers who reviewed products.
length(my_data$`review/score`)
[1] 21970
There are 0 ratings.
hist(as.numeric(my_data$`review/score`),main = "Histogramm of scores",xlab = "Score")
Products seem to be favorably rated as the distribution of scores showes that the best score is the most frequent.
my_data %>%
group_by(`review/userId`) %>%
summarise(Freq=n()) %>%
summary()
review/userId Freq
A10412572BPZJM6QSB69S: 1 Min. : 9.00
A104D32SF6TX7F : 1 1st Qu.: 10.00
A10IQD569MWNGU : 1 Median : 15.00
A1115ST6F5CWYP : 1 Mean : 16.69
A111Z6YLF7VARM : 1 3rd Qu.: 18.00
A112JF58KKB8LP : 1 Max. :561.00
(Other) :1310
In the original data set It users left on average left a review only once. After filtering, we see that our average is at 3 reviews per user.
(grand.mean <-my_data %>%
group_by(`review/userId`) %>%
dplyr::summarise(Mean=mean(`review/score`)) %>%
mutate(Grand.mean=mean(Mean))%>%
head())
It seems that beauty products on Amazon are well received by users as the average score per user is quite high, at 4.1956189.
Here is a glimpse in our data before we start building the recommnder:
head(my_data)
In order to model a recommender system, three variables in our case are of great importance:
Our model will be based on these three variables. Additionally, we will make use of the remaining features by utilizing some text mining techniques, but you will find more details at some later point. Now, we will make a subset of our data with 3 mentioned variables:
subset_my_data <- subset(my_data, select = c(`review/userId`,`product/productId`,`review/score`))
head(subset_my_data)
Let us inspect the dimensions:
dim(subset_my_data)
[1] 21970 3
Our data is currently in the long format, i.e. one row for one rating. However, we would want to get a matrix with ratings where the rows represent the users IDs and the columns the Product IDs. Thus, we will transform our data to so called rating matrix:
ratings <- as(subset_my_data, "realRatingMatrix")
In order to avoid “high/low rating bias” from users who give high (or low) ratings to all the products they reviewed, we will need to normalize our data. That would prevent certain bias in the results.
ratings <- normalize(ratings)
We can plot an image of the rating matrix for the first 250 users and 250 products:
image(ratings[1:250,1:250])
From the visualisation we can see that rating matrix is very sparse, i.e. that not every user did rate/review every product in our data set.
We can inspect the data for the first 10 users and the first 4 products:
ratings[1:10, 1:4]@data
10 x 4 sparse Matrix of class "dgCMatrix"
B00004RF1H B00004U9UY B000050B6X B000050B6Y
A10412572BPZJM6QSB69S . . . .
A104D32SF6TX7F . . . .
A10IQD569MWNGU . . . .
A1115ST6F5CWYP . . . .
A111Z6YLF7VARM . . . .
A112JF58KKB8LP . . . .
A1159DQXCJXDNN . . . .
A117GF5NSKVZ55 . . . .
A1194J1H29WSV . . . .
A11B8JNLONAAPU . . . .
As we already saw in the visualisation, the data is sparse and the first 10 users did not review first 4 products visualised in the matrix above.
Finally, we will now build our recommendation system based on User-based collaborative filtering User-based collaborative filtering search for similar users and gives them recommendations based on what other users with similar rating patterns appreciated:
recommender <- Recommender(ratings, method="UBCF")
recommender
Recommender of type ‘UBCF’ for ‘realRatingMatrix’
learned using 1316 users.
Additionally, in order to compare results of two methods, we would like to apply item-based collaborative filtering method to build another recommender system. In contrast to user-based collaborative filtering, item-based collaborative filtering looks for similarity patterns between items and recommends them to users based on the computed information.
recommenderIBCF <- Recommender(ratings, method="IBCF")
recommenderIBCF
As reported, both recommendation systems are built using 8002 users.
Now we would like to interpret the output of our recommender systems. First we start with UBCF-based recommender system.
current.user <- 45
recommendations <- predict(recommender, current.user, data = ratings, n = 5)
We decided to take user number 45 and inspect 5 recommendations provided to him/her. Now we can inspect what our recommendation system provided in the end:
str(recommendations)
Formal class 'topNList' [package "recommenderlab"] with 4 slots
..@ items :List of 1
.. ..$ A14E1HUV4A2ILV: int [1:5] 1008 1010 1012 1025 1026
..@ ratings :List of 1
.. ..$ A14E1HUV4A2ILV: num [1:5] 5 5 5 5 5
..@ itemLabels: chr [1:4432] "B00004RF1H" "B00004U9UY" "B000050B6X" "B000050B6Y" ...
..@ n : int 5
We can see that the user ID of the user number 45 is A10N19OL0CKYDV. Our system found 2 products to recommend to this user, and we can find product index (173, 772) as well as ratings that the system calculated from the ratings of the closest users (5,5).
Let us create a prediction made by IBCF-based recommender:
recommendationsIBCF <- predict(recommenderIBCF,current.user,data = ratings, n=5)
str(recommendationsIBCF)
We will inspect potential recommended products:
head(as(recommendationsIBCF,"list"))
Unfortunately, our item-based collaborative filtering system did not generate any recommendation for the user number 45.
Let us now identify the products recommended by UBCF-based recommender. First we need to extract the index of the recommended products:
index<- as.vector(as.factor(unlist(as(recommendations, "list"))))
Then we find corresponding product in our initial data set:
(recommendation_26<-my_data[match(index, my_data$`product/productId`),])
Two products recommended are :
Let us now inspect products that the user A10N19OL0CKYDV rated:
my_data[match("A10N19OL0CKYDV",my_data$`review/userId`),]
As we could see, this user reviewed only one product, called “Opi Ridge Filler .5 oz.”, and it is a nail-care product. We could assume that this person is a female user since the product she bought is typically associated with female beauty care. What is more, two recommended products are as well very strongly associated to being typical female beauty products. Finally, we have the name of the user (Erica), so we can be sure that the user is a female. From the qualitative perspective it seems that our recommendation system provides descent recommendations!.
In addition to our recommender system, we will apply some basic text mining techniques to explore reviews text. Text mining helps us to mine opinions of users (in this case) about the reviewed products at scale.
Here we create a wordcloud of words from product reviews of recommended products to the user 45. Beforehand we would need to pre-process the text of reviews in the following manner:
# Split text into parts using new line character:
text.docs <- Corpus(VectorSource(recommendation_26$`review/text`))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
text.docs <- tm_map(text.docs, toSpace, "/")
text.docs <- tm_map(text.docs, toSpace, "@")
text.docs <- tm_map(text.docs, toSpace, "\\|")
text.docs <- tm_map(text.docs, content_transformer(tolower))
text.docs <- tm_map(text.docs, removeNumbers)
text.docs <- tm_map(text.docs, stripWhitespace)
text.docs <- tm_map(text.docs, removeWords, stopwords("english"))
text.docs <- tm_map(text.docs, removePunctuation)
dtm <- DocumentTermMatrix(text.docs, control=list(weighting=weightTf))
m <- as.matrix(t(dtm))
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
From the wordcloud we can see that words “color”, “hair” and “gloves” are quite frequent in the text corpus analyzed. That could be a hint that the user was referring to the usage of the product. The term “cheap” could be easily spotted as well. This word is not very likable among marketers as it brings unfavorable image to the brand. Nevertheless, it seems that the user believes that the product is affordable.
This data set provides multiple possibility for the further analysis besides recommender systems. Here are some ideas what can be further done:
Sentiment analysis - Sentiment analysis can be done and scores (typically from -3 to +3) accompanied to each review description. That would tell us more about the sentiment that users have about the products reviewed.
Prediction of ratings - In case that we would have enough data (ratings) about one product, regardless of customers, it would be possible to develop a machine learning model which based on current features (e.g. price) and additional features (such as sentiment or words in the review) could predict the rating that one product might have.
Prediction of the sentiment - in the similar manner as the previous point, it would be useful to train a machine learning model to predict a sentiment that would hypotetically emerge in a reviewer.
Topic modeling - topic modeling is unsupervised machine learning technique that could help us identify topics which users discuss in the text of reviews.
Limitation related to this data set and building a recommender system is the fact that the majority of users have left only one review:
table(as.data.frame(table(my_data$`review/userId`))$Freq)
0 9 10 11 12 13 14 15 16 17 18 19
143454 238 111 66 65 30 35 260 135 45 31 29
20 21 22 23 24 25 26 27 28 29 30 31
24 26 13 15 82 18 9 5 6 3 10 3
32 34 35 36 37 38 39 41 43 45 46 47
3 2 6 7 4 5 4 2 2 4 2 1
48 49 50 53 59 62 64 75 85 86 158 205
1 1 1 1 2 1 1 1 2 1 1 1
561
1
Let us take a look which users left the most reviews:
limitations <-as.data.frame(table(my_data$`review/userId`))
limitations %>% arrange(desc(Freq))%>%rename(UserID=Var1)%>% head()
We can see that users under IDs A3M174IC0VXOS2,A3KEZLJ59C1JVH,A3QEE0ZPMT3W6P are rare examples of users who left multiple product reviews.